ground truth label
Evaluating multiple models using labeled and unlabeled data
It is difficult to evaluate machine learning classifiers without large labeled datasets, which are often unavailable. In contrast, unlabeled data is plentiful, but not easily used for evaluation. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. The key idea is to estimate the joint distribution of ground truth labels and classifier scores using a semi-supervised mixture model. The semi-supervised mixture model allows SSME to learn from three sources of information: unlabeled data, multiple classifiers, and probabilistic classifier scores. Once fit, the mixture model enables estimation of any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or AUC). We derive theoretical bounds on the error of these estimates, showing that estimation error decreases with the number of classifiers and the amount of unlabeled data. We present experiments in four domains where obtaining large labeled datasets is often impractical: healthcare, content moderation, molecular property prediction, and text classification. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best method.
ATheory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning (Supplementary File)
This supplementary document contains more additional experimental details and the technical proofs of convergence results of the NeurIPS'21 submission entitled "ATheory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning". It is structured as follows. In Appendix A, we provides more experimental details, including training algorithm, network architecture, optimizer details, loss construction and training cost of SANE. Appendix B presents the proof and details of the main results, namely, Theorem 1, in Section 2, which analyzes the generalization performance of MoCo. Next, Appendix C introduces the proof roadmap and details of the main results, i.e.
Weak Supervision Performance Evaluation via Partial Identification
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels, utilizing weak labels from heuristics, crowdsourcing, or pre-trained models. However, the absence of ground truth complicates model evaluation, as traditional metrics such as accuracy, precision, and recall cannot be directly calculated. In this work, we present a novel method to address this challenge by framing model evaluation as a partial identification problem and estimating performance bounds using Frรฉchet bounds. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques. Through scalable convex optimization, we obtain accurate and computationally efficient bounds for metrics including accuracy, precision, recall, and F1-score, even in high-dimensional settings. This framework offers a robust approach to assessing model quality without ground truth labels, enhancing the practicality of weakly supervised learning for real-world applications.